WIR 2009 Workshop Information Retrieval 2009 Editors

نویسندگان

  • Ingo Frommholz
  • Thomas Mandl
چکیده

Table extraction is the task of locating tables in a document and extracting their content along with its arrangement within the tables. The notion of tables applied in this work excludes any sort of meta data, e.g. only the entries of the tables are to be extracted. We follow a simple unsupervised approach by selecting the tables according to a score that measures the in-column consistency as pairwise similarities of entries where separator columns are also taken into account. Since the average similarity is less reliable for smaller tables this score demands a levelling in favor of greater tables for which we make different propositions that are covered by experiments on a test set of HTML documents. In order to reduce the number of candidate tables we use assumptions on the entry borders in terms of markup tags. They only hold for a part of the test set but allow us to evaluate any potential table without referring to the HTML syntax. The experiments indicate that the discriminative power of the incolumn similarities is limited but also considerable given the simplicity of the applied similarity functions.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

SIGIR WORKSHOP REPORT Information Retrieval and Advertising

The Information Retrieval and Advertising Workshop (IRA 2009) was held on July 23, 2009 in Boston, Massachusetts, in conjunction with the 32nd Annual ACM SIGIR Conference. The workshop covered theoretical and empirical issues in several research areas that span the intersection of computational advertising, information retrieval, and economics. The workshop consisted of 3 invited talks, 6 refer...

متن کامل

Workshop Report Information Retrieval and Advertising

The Information Retrieval and Advertising Workshop (IRA 2009) was held on July 23, 2009 in Boston, Massachusetts, in conjunction with the 32nd Annual ACM SIGIR Conference. The workshop covered theoretical and empirical issues in several research areas that span the intersection of computational advertising, information retrieval, and economics. The workshop consisted of 3 invited talks, 6 refer...

متن کامل

Flache und semantische Verarbeitung von Namen biochemischer Verbindungen

Termverarbeitung in der Domäne der Biowissenschaften beinhaltet für Information Retrieval, Data Mining, Information Extraction und für die Pflege wissenschaftlicher Datenbanken eine Reihe von Herausforderungen. Wir beschreiben diese Problematik und stellen unsere beiden Lösungsansätze vor. Dabei handelt es sich zum einen um ein normalisiertes Namensmatching und zum anderen um eine semantische N...

متن کامل

CIKM WORKSHOP REPORT Workshop on Large-Scale Distributed Systems for Information Retrieval

Due to the dramatically increasing amount of available data, effective and scalable solutions for data organization and search are essential. Distributed solutions naturally provide promising alternatives to standard centralized approaches. With the computational power of thousands or millions of computers in clusters or peer-to-peer systems, the challenges that arise are manifold, ranging from...

متن کامل

Workshop on Contextual Information Access, Seeking and Retrieval Evaluation

There are three parts to this talk – related in rather tangential ways. First, I will give a recap of an argument developed in a couple of earlier talks – at IIiX in 2008 and at the SIGIR evaluation workshop in 2009. The gist of the argument is about thinking about IR as a science, and the consequences Appears in the Proceedings of The 2nd International Workshop on Contextual Information Access...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2009